Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization

نویسندگان

  • Aurelien Bouteiller
  • Franck Cappello
  • Jack J. Dongarra
  • Amina Guermouche
  • Thomas Hérault
  • Yves Robert
چکیده

Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the system batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted computation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application completion time is unchanged, while it delivers near-perfect platform efficiency.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi - criteria checkpointing strategies : optimizing response - time versus resource utilization

Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to generalpurpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application comple...

متن کامل

ICL - UT - 1301 Multi - criteria checkpointing strategies : optimizing response - time versus resource utilization

Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to generalpurpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application comple...

متن کامل

An Intelligent Algorithm for Optimization of Resource Allocation Problem by Considering Human Error in an Emergency Department

Human error is a significant and ever-growing problem in the healthcare sector. In this study, resource allocation problem is considered along with human errors to optimize utilization of resources in an emergency department. The algorithm is composed of simulation, artificial neural network (ANN), design of experiment (DOE) and fuzzy data envelopment analysis (FDEA). It is a multi-response opt...

متن کامل

An Optimal Utilization of Cloud Resources using Adaptive Back Propagation Neural Network and Multi-Level Priority Queue Scheduling

With the innovation of cloud computing industry lots of services were provided based on different deployment criteria. Nowadays everyone tries to remain connected and demand maximum utilization of resources with minimum timeand effort. Thus, making it an important challenge in cloud computing for optimum utilization of resources. To overcome this issue, many techniques have been proposed ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013